Part I:What is International Movie Database (IMDB)?

IMDb (Internet Movie Database) is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, fan and critical reviews, and ratings. An additional fan feature, message boards, was abandoned in February 2017. Originally a fan-operated website, the database is owned and operated by IMDb.com, Inc., a subsidiary of Amazon. As of May 2019, IMDb has approximately 6 million titles (including episodes) and 9.9 million personalities in its database,[2] as well as 83 million registered users. IMDb began as a movie database on the Usenet group “rec.arts.movies” in 1990, and moved to the web in 1993.

Part II: Descriptive Statistics

MoviesL<- setnames(Movies, old=c("Rank", 
                          "Title",
                          "Genre",
                          "Description",
                          "Director", 
                          "Actors", 
                          "Year",
                          "Runtime..Minutes.",
                          "Rating", 
                          "Votes", 
                          "Revenue..Millions.", 
                          "Metascore"),
                    new=c("Rank", 
                          "Title",
                          "Genre",
                          "Description",
                          "Director", 
                          "Actors", 
                          "Year",
                          "Runtime",
                          "Rating", 
                          "Votes", 
                          "Revenue", 
                          "Metascore"))
WA <- MoviesL
WA %>% 
    tab_cells(Year, Runtime, Rating, Votes, Revenue, Metascore) %>%
    tab_cols(total(label = "#Total")) %>% 
    tab_stat_fun("Mean" = w_mean, 
                 "S.D." = w_sd, 
                 "min" = w_min, 
                 "max" = w_max,
                 "Var." = w_var, 
                 "median" = w_median,
                  method = list) %>%
    tab_pivot() %>% 
    htmlTable(., css.cell = c("width: 100px", 
                              rep("width: 120px", ncol(.) - 1)))
 #Total 
 Mean   S.D.   min   max   Var.   median 
 Year  2012.8 3.2 2006.0 2016.0 1.030000e+01 2014.0
 Runtime  113.2 18.8 66.0 191.0 3.539000e+02 111.0
 Rating  6.7 0.9 1.9 9.0 9.000000e-01 6.8
 Votes  169808.3 188762.6 61.0 1791916.0 3.563134e+10 110799.0
 Revenue  83.0 103.3 0.0 936.6 1.066130e+04 48.0
 Metascore  59.0 17.2 11.0 100.0 2.957000e+02 59.5
a1 <- ggplot(MoviesL, aes(x=Runtime))+ 
  geom_density()+
  geom_histogram(aes(y=..density..), 
                  binwidth=1,
                  colour="black", 
                 fill= "blue", 
                 alpha = 0.5)+
  theme_bw()
a <- ggplot(MoviesL, aes(x=Runtime))+ 
  geom_density()+
  geom_histogram(aes(y=..density..), 
                  binwidth=2,
                  colour="black", 
                 fill= "blue", 
                 alpha = 0.5)+
  theme_bw()

############
b1 <- ggplot(MoviesL, aes(x=Rating))+ 
  geom_density()+
  geom_histogram(aes(y=..density..), 
                  binwidth=1,
                  colour="black", 
                 fill= "red", 
                 alpha = 0.5)+
  theme_bw()
b <- ggplot(MoviesL, aes(x=Rating))+ 
  geom_density()+
  geom_histogram(aes(y=..density..), 
                  binwidth=0.25,
                  colour="black", 
                 fill= "red", 
                 alpha = 0.5)+
  theme_bw()

############
c1 <- ggplot(MoviesL, aes(x=Votes))+ 
  geom_density()+
  geom_histogram(aes(y=..density..), 
                  binwidth=1,
                  colour="black", 
                 fill= "green", 
                 alpha = 0.5)+
  theme_bw()

c <- ggplot(MoviesL, aes(x=Votes))+ 
  geom_density()+
  geom_histogram(aes(y=..density..), 
                  binwidth=30000,
                  colour="black", 
                 fill= "green", 
                 alpha = 0.5)+
  theme_bw()
###########
d1 <- ggplot(MoviesL, aes(x=Revenue))+ 
  geom_density()+
  geom_histogram(aes(y=..density..), 
                  binwidth=1,
                  colour="black", 
                 fill= "lightblue", 
                 alpha = 0.5)+
  theme_bw()
d <- ggplot(MoviesL, aes(x=Revenue))+ 
  geom_density()+
  geom_histogram(aes(y=..density..), 
                  binwidth=50,
                  colour="black", 
                 fill= "lightblue", 
                 alpha = 0.5)+
  theme_bw()
##########
e1 <- ggplot(MoviesL, aes(x=Metascore))+ 
  geom_density()+
  geom_histogram(aes(y=..density..), 
                  binwidth=1,
                  colour="black", 
                 fill= "darkred", 
                 alpha = 0.5)+
  theme_bw()
e <- ggplot(MoviesL, aes(x=Metascore))+ 
  geom_density()+
  geom_histogram(aes(y=..density..), 
                  binwidth=5,
                  colour="black", 
                 fill= "darkred", 
                 alpha = 0.5)+
  theme_bw()

ggarrange(a1, a, b1, b, c1, c, d1, d, e1, e, 
          ncol = 2, 
          nrow = 5)

1) What are your observations about the distributions of the variables above?

The distributions above do not seem to be a normal distribution, they either have a light skew to the right or to the left. In the case of Votes and revenue, we see a strong right skrew to the data. This means that there are a great number of movies with lower ratings and revenues compared to movies with a large vots or revenues. This will affect the mean as it will not be an accurate representation of our data.

2) What are the ranges?

The ranges of the data above is as follows:

3) Are there any outliers?

4) Are there any films that are “greatly” different than others?

5) What happens when you change the bin size from 1 to 2? What differences do you see?

MoviesL$Comedy <- ifelse(grepl("Comedy", MoviesL$Genre, ignore.case = T), "Comedy","Not Comedy")
MoviesL$Action <- ifelse(grepl("Action", MoviesL$Genre, ignore.case = T), "Action","Not Action")


A1 <-ggplot(MoviesL,aes(x=Action, y=Runtime, color = Action)) + 
    geom_boxplot() + 
    geom_jitter(width=0.25, alpha=0.05)+
    theme(legend.position = "none",
          axis.text.y=element_blank(),
          axis.title.x=element_blank())
A2 <-ggplot(MoviesL,aes(x=Action, y=Rating, color = Action)) + 
    geom_boxplot() + 
    geom_jitter(width=0.25, alpha=0.05)+
    theme(legend.position = "none",
          axis.text.y=element_blank(),
          axis.title.x=element_blank())
A3 <-ggplot(MoviesL,aes(x=Action, y=Votes, color = Action)) + 
    geom_boxplot() + 
    geom_jitter(width=0.25, alpha=0.05)+
    theme(legend.position = "none",
          axis.text.y=element_blank(),
          axis.title.x=element_blank())
A4 <-ggplot(MoviesL,aes(x=Action, y=Revenue, color = Action)) + 
    geom_boxplot() + 
    geom_jitter(width=0.25, alpha=0.05)+
    theme(legend.position = "none",
          axis.text.y=element_blank(),
          axis.title.x=element_blank())
A5 <-ggplot(MoviesL,aes(x=Action, y=Metascore, color = Action)) + 
    geom_boxplot() + 
    geom_jitter(width=0.25, alpha=0.05)+
    theme(legend.position = "none",
          axis.text.y=element_blank(),
          axis.title.x=element_blank())

C1 <-ggplot(MoviesL,aes(x=Comedy, y=Runtime, color = Comedy)) + 
    geom_boxplot() + 
    geom_jitter(width=0.25, alpha=0.05)+
    theme(legend.position = "none",
          axis.text.y=element_blank(),
          axis.title.x=element_blank())
C2 <-ggplot(MoviesL,aes(x=Comedy, y=Rating, color = Comedy)) + 
    geom_boxplot() + 
    geom_jitter(width=0.25, alpha=0.05)+
    theme(legend.position = "none",
          axis.text.y=element_blank(),
          axis.title.x=element_blank())
C3 <-ggplot(MoviesL,aes(x=Comedy, y=Votes, color = Comedy)) + 
    geom_boxplot() + 
    geom_jitter(width=0.25, alpha=0.05)+
    theme(legend.position = "none",
          axis.text.y=element_blank(),
          axis.title.x=element_blank())
C4 <-ggplot(MoviesL,aes(x=Comedy, y=Revenue, color = Comedy)) + 
    geom_boxplot() + 
    geom_jitter(width=0.25, alpha=0.05)+
    theme(legend.position = "none",
          axis.text.y=element_blank(),
          axis.title.x=element_blank())
C5 <-ggplot(MoviesL,aes(x=Comedy, y=Metascore, color = Comedy)) + 
    geom_boxplot() + 
    geom_jitter(width=0.25, alpha=0.05)+
    theme(legend.position = "none",
          axis.text.y=element_blank(),
          axis.title.x=element_blank())



ggarrange(A1, A2, A3, A4, A5, C1, C2, C3, C4, C5, 
          ncol = 5, 
          nrow = 2 )

Add brief comments to your code and explain what your code is doing.

Here we see the representations of different categories to the adjasent variables, on top is the distinction between Action and not action and on the bottom are the distinctions between comedy and not comedy. This will allow us to see if there are any differences within the Genre of movies. The first Variable we observed was runtime, Action movies tend to have a longer runtime compared to Not action movies, however not Comedy movies tend to have a loger runtime compared to Comedy movies. When observing our second variable, Rating, we see no clear difference between the variables. However, in regards to the votes, Action movies tend to recieve more votes compared to not action movies and comedy mvoies tend to recieve a few less votes than not comedy movies which was very suprising to me. The fourth variable we observed, Revenue, had no distinctions between Comedy and not Comedy yet Action movies tend to have a distinct higher revenue compared to not action movies. Although there are a decent amount of outliers in the dataset, the middle 50% of the data is still significantly greater than not Action movies. The last variable we looked at was Metascore, once again there was no difference between Comedy and not Comedy movies. When comparing Action and not Action movies, we see that not action movies tend to have a higher overall metascore compared to Action movies, I think this is becasue Action movies are not for everyone as they can be quite violent. This analysis helped determine the affect of Genre on different types of variables in movies.

Part III: Bivariate Statistics

ggcorr(MoviesL, 
       label = TRUE,
       label_alpha = TRUE,
       layout.exp = 0,
       size = 3,
       label_size = 4,
       angle = -45)+
       theme(plot.margin = unit(c(0,0,0,0), "cm"))

What do you see at the end? Provide your comments about the heatmap. Which variables seem to be closely related?

The heat map shows us both in numbers as well as in colors how closely one variable is correlated, or affects, another variable. A corelation close to 1 shows a strong positive relationship, a corelation close to -1 shows a strong negative relationship, and a corelation close to 0 shows that there is no relationwship. The variables that I found interesting were that year and votes were negatively corelated. They are not strongly negativly corelated but still more than the rest. This could be that older movies tend to not be votes as much as they do not have the modern technology or resolution of modern movies which might be voted higher. The other variables I wanted to investigate are all positivly corelated. Rating and Metascore have a Corelation value of 0.6, this means that the higher rated movies tend to also have a higher metascore, as expected. In addition the corelation between Revenue and Votes is equally as high with a value of 0.6, once again this seems like a reasonable corelation, movies with more votes ted to have a higher revenue. Movies that are accepted by people tend to be passed on and viewed more than movies that are not accpeted. In addition to votes, movies with a high rating also tended to get more votes which again seems like a reasonable deduction from our data. Overall this seems to be a interesting corelation as Revenue was not corelated with metascore at all which was showking to me.

Movies1 <- Movies [, c(7, 8, 9, 10, 11,12)]
chart.Correlation(Movies1, histogram = TRUE)

1) Which variables seem to have a linear relationship?

2) Which variables seem to have a non-linear relationship?

3) Which variables seem to have no relationship?

Part IV: Distributions

MoviesL$JenniferGarner <- ifelse(grepl("Jennifer Garner", MoviesL$Actors , ignore.case = T), "With Jennifer Garner","Without Jennifer Garner")

Jenn_df <- MoviesL[MoviesL$JenniferGarner == "Yes",]

WB <- MoviesL 

WB %>% 
    tab_cells(Year, Runtime, Rating, Votes, Revenue, Metascore) %>%
    tab_cols(JenniferGarner) %>% 
    tab_stat_fun("Mean" = w_mean, 
                 "S.D." = w_sd, 
                 "min" = w_min, 
                 "max" = w_max,
                 "Var." = w_var, 
                 "median" = w_median,
                  method = list) %>%
    tab_pivot() %>%
htmlTable(., css.cell = c("width: 100px", 
                              rep("width: 70px", ncol(.) - 1)))
 JenniferGarner 
 With Jennifer Garner     Without Jennifer Garner 
 Mean   S.D.   min   max   Var.   median     Mean   S.D.   min   max   Var.   median 
 Year  2013.7 3.5 2007.0 2016.0 1.230000e+01 2015.0   2012.8 3.2 2006.0 2016.0 1.030000e+01 2014.0
 Runtime  99.3 13.8 81.0 117.0 1.899000e+02 101.0   113.3 18.8 66.0 191.0 3.539000e+02 111.0
 Rating  6.9 1.0 5.3 8.0 1.000000e+00 7.2   6.7 0.9 1.9 9.0 9.000000e-01 6.8
 Votes  140391.0 197269.9 291.0 432461.0 3.891541e+10 22372.5   169985.8 188800.0 61.0 1791916.0 3.564542e+10 111191.5
 Revenue  53.2 51.1 0.0 143.5 2.607900e+03 44.5   83.2 103.5 0.0 936.6 1.071400e+04 48.0
 Metascore  55.8 26.9 11.0 84.0 7.214000e+02 57.5   59.0 17.1 15.0 100.0 2.936000e+02 59.5
Runtime1<-pnorm(87, mean=113.3, sd = 18.8)*100
Runtime2<-pnorm(109, mean=113.3, sd = 18.8)*100
Runtime3<-pnorm(81, mean=113.3, sd = 18.8)*100
Runtime4<-pnorm(106, mean=113.3, sd = 18.8)*100
Runtime5<-pnorm(96, mean=113.3, sd = 18.8)*100
Runtime6<-pnorm(117, mean=113.3, sd = 18.8)*100

Year1<-pnorm(2016, mean=2012.8, sd = 3.2)*100
Year2<-pnorm(2016, mean=2012.8, sd = 3.2)*100
Year3<-pnorm(2014, mean=2012.8, sd = 3.2)*100
Year4<-pnorm(2016, mean=2012.8, sd = 3.2)*100
Year5<-pnorm(2007, mean=2012.8, sd = 3.2)*100
Year6<-pnorm(2013, mean=2012.8, sd = 3.2)*100

Rating1<-pnorm(5.3, mean = 6.7, sd = 0.9)*100
Rating2<-pnorm(7.0, mean = 6.7, sd = 0.9)*100
Rating3<-pnorm(6.2, mean = 6.7, sd = 0.9)*100
Rating4<-pnorm(7.5, mean = 6.7, sd = 0.9)*100
Rating5<-pnorm(7.5, mean = 6.7, sd = 0.9)*100
Rating6<-pnorm(8.0, mean = 6.7, sd = 0.9)*100

Votes1<-pnorm(12435, mean = 169985.8, sd = 188800.0)*100
Votes2<-pnorm(12048, mean = 169985.8, sd = 188800.0)*100
Votes3<-pnorm(32310, mean = 169985.8, sd = 188800.0)*100
Votes4<-pnorm(291, mean = 169985.8, sd = 188800.0)*100
Votes5<-pnorm(432461, mean = 169985.8, sd = 188800.0)*100
Votes6<-pnorm(352801, mean = 169985.8, sd = 188800.0)*100

Revenue1<-pnorm(19.64, mean = 83.2, sd = 103.5)*100
Revenue2<-pnorm(61.69, mean = 83.2, sd = 103.5)*100
Revenue3<-pnorm(66.95, mean = 83.2, sd = 103.5)*100
Revenue4<-pnorm(0.01, mean = 83.2, sd = 103.5)*100
Revenue5<-pnorm(143.49, mean = 83.2, sd = 103.5)*100
Revenue6<-pnorm(27.30, mean = 83.2, sd = 103.5)*100

Metascore1<-pnorm(11, mean = 59.0, sd = 17.1)*100
Metascore2<-pnorm(44, mean = 59.0, sd = 17.1)*100
Metascore3<-pnorm(54, mean = 59.0, sd = 17.1)*100
Metascore4<-pnorm(61, mean = 59.0, sd = 17.1)*100
Metascore5<-pnorm(81, mean = 59.0, sd = 17.1)*100
Metascore6<-pnorm(84, mean = 59.0, sd = 17.1)*100
DT <- data.table(Movies = c("Nine Lives", "Miracles from Heaven", "Alexander and the Terrible, Horrible, No Good, Very Bad Day", "Wakefield", "Juno", "Dallas Buyers Club"),
                 Runtime = c(Runtime1,Runtime2, Runtime3, Runtime4, Runtime5, Runtime6),
                 Year = c(Year1,Year2, Year3, Year4, Year5, Year6),
                 Rating = c(Rating1, Rating2, Rating3, Rating4, Rating5, Rating6),
                 Votes = c(Votes1, Votes2, Votes3, Votes4, Votes5, Votes6),
                 Revenue = c(Revenue1, Revenue2, Revenue3, Revenue4, Revenue5, Revenue6),
                 Metascore = c(Metascore1, Metascore2, Metascore3, Metascore4, Metascore5, Metascore6))
                 


formattable(DT, 
            align = c("l","c", "c", "c", "c", "c" ),
            list('Movies' = formatter(
              "span" , style = ~style(color = "gray", font.weight = "bold" )),
              'Runtime' = color_tile("#DeF7E9", "#71CA97"),
              'Year' = color_tile("#DeF7E9", "#71CA97"),
              'Rating' = color_tile("#DeF7E9", "#71CA97"),
              'Votes' = color_tile("#DeF7E9", "#71CA97"),
              'Revenue' = color_tile("#DeF7E9", "#71CA97"),
              'Metascore' = color_tile("#DeF7E9", "#71CA97")))
Movies Runtime Year Rating Votes Revenue Metascore
Nine Lives 8.091606 84.134475 5.990691 20.20038 26.95725 0.2500126
Miracles from Heaven 40.954195 84.134475 63.055866 20.14270 41.76824 19.0190914
Alexander and the Terrible, Horrible, No Good, Very Bad Day 4.289055 64.616977 28.925736 23.29351 43.76205 38.4991298
Wakefield 34.889781 84.134475 81.296860 18.43777 21.07655 54.6553754
Juno 17.873079 3.495449 81.296860 91.77703 71.98891 90.0874359
Dallas Buyers Club 57.801130 52.491767 92.569300 83.35529 29.45652 92.8127793

1) What are some of your observations?

2) How do Jennifer Garner movies compare to the rest of the movies in the dataset?

Ptable <- as.data.frame(Metadata)



kable(Ptable, caption = "Movies with and without Jennifer Garner" ) %>% 
  kable_styling("striped") %>%
  pack_rows("Metascore", 1,2) %>% 
  pack_rows("Rank", 3,4) %>% 
  pack_rows("Rating", 5,6) %>%
  pack_rows("Revenue", 7,8) %>% 
  pack_rows("Runtime", 9,10) %>% 
  pack_rows("Votes", 11,12) %>%
  pack_rows("Year", 13,14)
Movies with and without Jennifer Garner
…1 Mean Std.Dev Min Q1 Median Q3 Max MAD IQR CV
Metascore
With 5.900538e+01 1.713541e+01 15.00 47.00 59.500 72.00 100.00 18.53250 25.000 0.2904042
Without 5.583333e+01 2.685827e+01 11.00 44.00 57.500 81.00 84.00 27.42810 29.500 0.4810436
Rank
With 4.996841e+02 2.882422e+02 1.00 250.00 499.500 749.00 999.00 369.90870 498.500 0.5768489
Without 6.356667e+02 3.797292e+02 69.00 333.00 715.500 981.00 1000.00 407.71500 559.750 0.5973716
Rating
With 6.722032e+00 9.455080e-01 1.90 6.20 6.800 7.40 9.00 0.88956 1.200 0.1406581
Without 6.916667e+00 9.988327e-01 5.30 6.20 7.250 7.50 8.00 0.74130 1.100 0.1444095
Revenue
With 8.316268e+01 1.035083e+02 0.00 13.18 47.985 113.73 936.63 61.34999 100.515 1.2446490
Without 5.318000e+01 5.106796e+01 0.01 19.64 44.495 66.95 143.49 35.07090 44.080 0.9602851
Runtime
With 1.132555e+02 1.881138e+01 66.00 100.00 111.000 123.00 191.00 17.79120 23.000 0.1660968
Without 9.933333e+01 1.377921e+01 81.00 87.00 101.000 109.00 117.00 16.30860 19.000 0.1387169
Votes
With 1.699858e+05 1.888000e+05 61.00 37033.00 111191.500 239772.00 1791916.00 130363.50000 202486.500 1.1106810
Without 1.403910e+05 1.972699e+05 291.00 12048.00 22372.500 352801.00 432461.00 24022.57000 260533.500 1.4051460
Year
With 2.012778e+03 3.205274e+00 2006.00 2010.00 2014.000 2016.00 2016.00 2.96520 6.000 0.0015925
Without 2.013667e+03 3.502380e+00 2007.00 2013.00 2015.000 2016.00 2016.00 1.48260 2.750 0.0017393

Part V: Running Linear Regression

model1 <- lm(Revenue ~ Rating + Metascore, data=MoviesL)

summ(model1)
Observations 838 (162 missing obs. deleted)
Dependent variable Revenue
Type OLS linear regression
F(2,835) 20.66
0.05
Adj. R² 0.04
Est. S.E. t val. p
(Intercept) -92.79 28.81 -3.22 0.00
Rating 26.39 5.44 4.85 0.00
Metascore -0.04 0.28 -0.15 0.88
Standard errors: OLS

Are the estimates significant? Look at the signif. codes: reference in the output. Report, if the estimates have any “stars”.

plotly::plot_ly(data = MoviesL, 
        z = ~Revenue, 
        x = ~Metascore, 
        y = ~Rating, 
        color = ~JenniferGarner, 
        colors = c('blue' ,'red'),opacity = 0.5) %>%
  plotly::add_markers( marker = list(size = 3))
# getting the data right  11111
MoviesL1 <- MoviesL %>% select(Rating, Revenue)

# fit the model 
fit <- lm(Revenue ~ Rating, data = MoviesL1, na.action = na.exclude)

# find r from R^2 
R21 <- signif(summary(lm(Revenue ~ Rating, data = MoviesL))$adj.r.squared)
R1 <- round(sqrt(R21), digits = 3) 


#obtain predicted and residual values 
MoviesL1$predicted <- predict(fit)
MoviesL1$residuals <- resid(fit)

#create the plot 
Graph1<-ggplot(MoviesL1, aes(x=Rating, y = Revenue))+
  geom_segment(aes(xend= Rating, yend = predicted), alpha = .1)+
  scale_y_continuous(labels = scales::comma)+
  geom_point(aes(color = residuals), alpha = 1.0)+
  scale_color_gradient2(low = "blue", mid = "lightgray", high = "red")+
  guides(color = FALSE)+
  ylab("Revenue")+
  xlab("Rating")+
  geom_point(aes(y=predicted), shape = 1, alpha = .5)+
  theme_bw(base_size = 8)+
  theme(plot.title = element_text(size = 8))+
  geom_smooth(method = lm, color = "gray")+ 
  ggtitle(paste(" R2 = ",signif(summary(lm(Revenue ~ Rating, data = MoviesL1))$r.squared, 5),
                "\n R = ", R1,
                "\n Intercept =",signif(lm(Revenue ~ Rating, data = MoviesL1)$coef[[1]],5 ),
                "\n Slope =",signif(lm(Revenue ~ Rating, data = MoviesL1)$coef[[2]], 5)))+ 
  annotate("text", x =2.2, y = 750, label = "italic(hat(y)) ", parse = TRUE, size = 3) +
  annotate("text", x = 3.6, y = 750, label = "= -90.738 + 25.49x", size = 3) 
# getting the data right  11111
MoviesL2 <- MoviesL %>% select(Metascore, Revenue)

# fit the model 
fit2 <- lm(Revenue ~ Metascore, data = MoviesL2, na.action = na.exclude)

# find r from R^2 
R31 <- signif(summary(lm(Revenue ~ Metascore, data = MoviesL2))$adj.r.squared)
R1 <- round(sqrt(R31), digits = 3) 


#obtain predicted and residual values 
MoviesL2$predicted <- predict(fit)
MoviesL2$residuals <- resid(fit)

#create the plot 
Graph2 <- ggplot(MoviesL2, aes(x=Metascore, y = Revenue))+
  geom_segment(aes(xend= Metascore, yend = predicted), alpha = .1)+
  scale_y_continuous(labels = scales::comma)+
  geom_point(aes(color = residuals), alpha = 1.0)+
  scale_color_gradient2(low = "blue", mid = "lightgray", high = "red")+
  guides(color = FALSE)+
  ylab("Revenue")+
  xlab("Metascore")+
  geom_point(aes(y=predicted), shape = 1, alpha = .5)+
  theme_bw(base_size = 8)+
  theme(plot.title = element_text(size = 8))+
  geom_smooth(method = lm, color = "gray")+ 
  ggtitle(paste(" R2 = ",signif(summary(lm(Revenue ~ Metascore, data = MoviesL2))$r.squared, 5),
                "\n R = ", R1,
                "\n Intercept =",signif(lm(Revenue ~ Metascore, data = MoviesL2)$coef[[1]],5 ),
                "\n Slope =",signif(lm(Revenue ~ Metascore, data = MoviesL2)$coef[[2]], 5)))+ 
  annotate("text", x =15, y = 750, label = "italic(hat(y)) ", parse = TRUE, size = 3) +
  annotate("text", x = 32, y = 750, label = "= 32.26 + 0.87795x", size = 3) 
ggarrange(Graph1, Graph2,
          ncol = 2, 
          nrow = 1)

AV1 <- 936.63
AV2 <- 760.51
AV3 <- 652.51
AV4 <- 623.28
AV5 <- 533.32

PV1 <- -92.79 + 26.3*8.1 -0.041*81
PV2 <- -92.79 + 26.3*7.8 -0.041*83
PV3 <- -92.79 + 26.3*7.0 -0.041*59
PV4 <- -92.79 + 26.3*8.1 -0.041*69
PV5 <- -92.79 + 26.3*9.0 -0.041*82


SEV1 <- (AV1-PV1)^2
SEV2 <- (AV2-PV2)^2
SEV3 <- (AV3-PV3)^2
SEV4 <- (AV4-PV4)^2
SEV5 <- (AV5-PV5)^2


DT1 <- data.table(Movies = c("Star Wars: Episode VII - The Force Awakens", "Avatar", "Jurassic World", "The Avengers", "The Dark Knight"),
                 "Actual Value" = c(AV1, AV2, AV3, AV4, AV5),
                 "Predicted Value" = c(PV1, PV2, PV3, PV4, PV5),
                 "Squared Error Vales" = c(SEV1, SEV2, SEV3, SEV4, SEV5))
                 
formattable(DT1, 
            align = c("l","c", "c", "c", "c", "c" ),
            list('Movies' = formatter(
              "span" , style = ~style(color = "gray", font.weight = "bold" )),
              'Squared Error Vales' = color_tile("#DeF7E9", "#71CA97")))
Movies Actual Value Predicted Value Squared Error Vales
Star Wars: Episode VII - The Force Awakens 936.63 116.919 671926.1
Avatar 760.51 108.947 424534.3
Jurassic World 652.51 88.891 317666.4
The Avengers 623.28 117.411 255903.4
The Dark Knight 533.32 140.548 154269.8

What do you think about the predictions?

Do you think the model is working well? Or, not so well?

Part VI: Conclusion

There are various of variables that come into consideration when looking at a movies success or failure and with the dataset that we have, I think that we just do not have enough variables to consider the success or failure of Jennifer Garner’s movies. Although the numeric datapoint are interesting, a further exploration into the budget and other actors would be worth while to give us a greater insight into the movies, their cast, and the budget to accuratly reflect a successful movie or not. Regardless I think that our predictive model can do a adequate job in predicting the revenue of movies based on the Metascore and Rating in mainstream movies.